STATS 32 Session 7: Importing your own data and factors
Kenneth Tay
Oct 15, 2019
Reminder!
Project proposals are due TOMORROW 16 Oct (Wed), 23:59:59
- 1-2 paragraph description of project (datset, problem of interest, potential visualizations)
- Submit text file on Canvas (link to or sample of dataset preferred, not necessary)
Recap of week 3
- Transforming data with
dplyr
select()
mutate()
arrange()
filter()
summarize()
group_by()
- Review of function syntax
- More data transformation with
tidyr
Function syntax
The most important syntax in R is the function call. All R syntax has function calls underlying it.
A function call consists of:
- Function name
- Parentheses, and
- A list of arguments within the parentheses
Function example
## [1] NA
Function example
## [1] -1
%>%
syntax with dplyr
Take the mtcars
dataset, select just the wt
and mpg
columns, then select rows with mpg < 15
+
syntax with ggplot2
Agenda for today
- Importing data with
readr
- Factors
“Official” cheat sheet for readr
available here.
Where does your data live?
- Data is stored in a “file”
- e.g.
.txt
(text) or .csv
(comma-separated values) file
- Files belong to a “folder”/“directory”
- Folders can be nested within other folders
Filepath example (Mac)
- Application: Finder
- Default “root” directory:
/
Filepath example (Windows)
- Application: File explorer
- Default “root” directory:
C:/
When I download a file, where does it go?
In Chrome: go to chrome://settings/downloads
to find out
File paths
- A character string that tells you the location of a file
- Absolute path: starts from the “root” directory
- e.g.
/Users/kjytay/Downloads/datafile.csv
File paths
- A character string that tells you the location of a file
- Absolute path: starts from the “root” directory
- e.g.
/Users/kjytay/Downloads/datafile.csv
- Relative path: starts from the current directory (denoted by
.
)
- e.g. If I am in the folder
/Users/kjytay
: ./Downloads/datafile.csv
- e.g. If I am in the folder
/Users/kjytay/Downloads
: ./datafile.csv
or simply datafile.csv
Working directories in R
- Directory where R looks for files that you ask it to load
- Also where R will put any files that you ask it to save
- You can see your current working directory at the top of the console or by typing
getwd()
How can I change my working directory in RStudio?
- You can issue the command
setwd("<path of new directory>")
- In the menu bar, click Session > Set Working Directory, then click one of the options in the sub-menu
Factors
- A concept unique to R
- Useful for working with categorical variables: variables that have a fixed and known set of possible values
Why use factor variables instead of character variables?
Reason 1: Character variables don’t protect you from typos
Why use factor variables instead of character variables?
Reason 1: Character variables don’t protect you from typos
Reason 2: Character variables don’t sort in a useful way
## [1] "Apr" "Dec" "Jan" "Mar"
Why use factor variables instead of character variables?
Reason 1: Character variables don’t protect you from typos
Reason 2: Character variables don’t sort in a useful way
## [1] "Apr" "Dec" "Jan" "Mar"
Factor variables can fix both of these easily.
How to convert a character variable to a factor variable?
- Use
factor()
(in base R) or parse_factor()
(from the readr
package)
- Give the function the list of valid categories, or levels
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x <- c("Dec", "Apr", "Jam", "Mar")
How to convert a character variable to a factor variable?
- Use
factor()
(in base R) or parse_factor()
(from the readr
package)
- Give the function the list of valid categories, or levels
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
How to convert a character variable to a factor variable?
- Use
factor()
(in base R) or parse_factor()
(from the readr
package)
- Give the function the list of valid categories, or levels
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## Warning: 1 parsing failure.
## row col expected actual
## 3 -- value in level set Jam
## [1] Dec Apr <NA> Mar
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set Jam
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
How to convert a character variable to a factor variable?
- Use
factor()
(in base R) or parse_factor()
(from the readr
package)
- Give the function the list of valid categories, or levels
## [1] Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Today’s dataset: NBA data
Factors under the hood
- Under the hood, the different levels of a factor are assigned numbers 1, 2, …
- The first level gets assigned the number 1, the next is 2, and so on
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## Factor w/ 12 levels "Jan","Feb","Mar",..: 12 4 NA 3
Ordered & unordered factors
- Factors can be ordered or unordered
- Example of ordered factor: T-shirt sizes (S < M < L)
- Example of unordered factor: ice-cream flavors
- R allows you to specify whether a factor should be ordered or not based on the
ordered
option in factor()
- Doesn’t matter which you use for the most part except when building models